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Abstract — The ability to express a program as a hierarchical 
composition of parts is an essential tool in managing the 
complexity of software and a key abstraction this provides is 
to separate the representation of data from the computation. 
Many current parallel programming models use a shared 
memory model to provide data abstraction but this doesn't 
scale well with large numbers of cores due to non-determinism 
and access latency. This paper proposes a simple programming 
model that allows scalable parallel programs to be expressed 
with distributed representations of data and it provides the 
programmer with the flexibility to employ shared or distributed 
styles of data-parallelism where applicable. It is capable of an 
efficient implementation, and with the provision of a small set 
of primitive capabilities in the hardware, it can be compiled to 
operate directly on the hardware, in the same way stack-based 
allocation operates for subroutines in sequential machines. 

Keywords -Parallel programming, composability, parallel sub- 
routines, data-parallelism, distributed memory, compilation 
techniques. 

I. Introduction 

When developing a program of any complexity, the ability 
to express it in terms of a simpler set of components is 
essential. A component presents a simple interface that 
allows its implementation to be considered independently, 
and when combined with other components, the internal 
details can be ignored and its functionality treated in an 
abstract way. This allows a program to be constructed using 
modules, ranging from small functions to libraries, and for 
any component to be substituted with another that adheres 
to the same interface. The importance of abstraction as a 
tool in computer programming was recognised by Turing 
in the 1940s [T) and was formalised in the 1970s by the 
structured programming methodology [2|. This aimed to 
improve the quality of programs and productivity of pro- 
grammers through judicious use of hierarchical structuring 
and subroutines. These principles have been foundational for 
modern sequential programming languages. 

A key issue with composability is separating the rep- 
resentation of data from the structure of a computation. 
Mainstream CPU and general-purpose GPU (GPGPU) par- 
allel programming models are based on a shared memory 
model, where data are globally accessible. This is the form 
of parallel random access machine (PRAM) (3] and the 
related bulk-synchronous parallel (BSP) |4) model. Shared 



memory parallelism allows sequential approaches to data 
abstraction and conventional data structures to be employed, 
but it does not scale well with large numbers of cores. 
Access latency can vary significantly and unpredictably 
due to the physical distribution of data across a machine. 
This makes it difficult to exploit locality, which is essential 
for scaling a computation, and poses problems for barriers 
which are delayed by the slowest participant. Additionally, 
when accesses are made to shared data they can incur latency 
from collisions, and when they are updating it, behaviour can 
become non-deterministic. 

There are a number of issues related to the implementation 
of a shared memory system that pose further problems for 
this type of data abstraction. Mainstream parallel processors 
take the form of symmetric multi-processors (SMPs) and 
these have brought about a number of parallel programming 
approaches such as the Cilk J6) language, OpenMP [7] and 
Intel's Threaded Building Blocks (TBB) [8|. These employ 
a multi-threaded execution model where a number of threads 
are managed by a scheduler, but problems can arise with pro- 
grams that combine parallel components. Performance can 
be affected significantly by threads competing for execution, 
causing unnecessary context-switches, and idling within a 
component due to a load imbalance, causing under utilisa- 
tion. The effects of this are dependent on combinations of 
program components and result in unpredictable execution 
time, exacerbating non-deterministic behaviour. OpenCL [9| 
is emerging into the mainstream and is designed to support 
the programming of heterogeneous systems. These are typ- 
ically comprised of CPUs and GPGPUs. It uses a shared 
memory model but exposes distinct address spaces and in 
order to compose components operating in different ones, 
variables must be explicitly transferred between them ifTTl . 

Parallelism is now the primary means of sustaining growth 
in computational performance |12| and the shared memory 
model will continue to be useful. However, it looks certain 
that future systems will involve large numbers of processors 
and it will not be effective in delivering performance on 
them. Therefore, it is necessary that parallel programming 
models, as well as supporting shared memory approaches, 
also support composable representations of distributed data. 

This paper proposes a simple distributed programming 
model that builds on the approach of the occam program- 



ming language [13] with notations to control the distribution 
of parallelism and a server construct that is active only in 
response to requests. Arrays of servers can be combined to 
construct distributed data structures, independently from the 
computational aspects of a program, providing access for 
shared or distributed styles of data-parallelism. This gives 
the programmer flexibility to employ the most appropriate 
data representation for the purposes of the program and 
scalability. Server-based data structures can be composed 
with similar scoping rules to conventional variable decla- 
rations to simplify the task of building scalable programs 
by allowing them to be composed in a modular way. With 
the provision of a small set of primitive capabilities in the 
hardware, the model can be compiled with a fixed allocation 
of processors. This is so it can operate efficiently and directly 
on the hardware, without the use of dynamic allocation 
mechanisms. The idea is similar to stack-based allocation 
for subroutines in sequential machines. 

The following specific contributions are made: 

1) A server construct that can be used to express compos- 
able representations of distributed data structures with 
arrays of server processes, for both shared memory 
and message passing distributed memory style parallel 
computations. 

2) An efficient implementation of distributed parallelism 
based on a compile-time allocation of processors. 

3) An implementation of server processes that allows 
many-to-one client connections to be established ef- 
ficiently and without deadlock. 

4) Demonstration of the proposed notations with three 
example programs that are characteristic of general- 
purpose applications and employ different styles of 
parallelism. 

The rest of this paper is organised as follows. Section [ID 
overviews related work; Section [Til] presents the proposed 
programming model and notations in terms of a conceptual 
machine model; Section [IV] describes the requirements of 
a target architecture and the compilation scheme for it; 
Section [V] discusses how several example programs that 
require distinct styles of parallelism can be expressed in the 
model; Section IVTI concludes. 

II. Related work 

Distributed memory architectures are most common in 
high performance computing (HPC) systems and the Mes- 
sage Passing Interface (MPI) [14| is the standard program- 
ming approach. MPI provides features for the construction of 
modular components such as libraries fl5l with features to 
name groups of processes and provide scoping for operations 
within them, but it does not allow a separation of data 
because of its SPMD (single program multiple data) model. 
The success of MPI can be attributed to its simple com- 
pilation and execution model, which provides predictable 
execution that allows programmers to make efficient use of 



a machine. Other more dynamic languages push resource 
allocation and management into runtime components that 
require significant overheads in execution time and storage, 
and result in less predictable execution of program compo- 
nents. 

Dynamic process creation was introduced in MPI-2, and 
in particular, a server construct, similar to the proposal in 
this paper, was introduced to address the need to support 
groups of reactive processes that accept connections from 
other groups ITT41 Sect. 10.4]. The problem with this is that 
the location of processes is not known at compile-time. To 
quote the specification directly Almost all of the complexity 
in MPI client/server routines addresses the question "how 
does the client find out how to contact the server?". This 
issue also lies at the heart this work, but the solution is 
simplified by the choice of notations, the restrictions placed 
on them and support required in the architecture. 

Partitioned global address space (PGAS) languages such 
as UPC |[161, Chapel 03 and X10 flJD are based on globally 
accessible variables that are divided into logical segments 
to provide a clean composition of distributed data and 
computation. These segments have affinity with particular 
processes to provide a notion of locality for fast memory 
accesses, and global accesses are compiled into message 
passing communications. These languages include a range of 
distributed data types with high level notations for operating 
on them. Static distributions can be compiled into message 
passing programs, although it is not yet clear how efficient 
they are compared to manually crafted MPI equivalents, and 
as yet PGAS languages have not had widespread adoption. 

Charm ]"l9l is another HPC-orientated language but takes 
a different approach. Parallelism is expressed with arrays 
of objects and communication is performed with remote 
method calls. A runtime system is responsible for dy- 
namically mapping objects onto processors and scheduling 
communication. As is the case with dynamic processes 
in MPI, this requires all communications to be directed 
through proxy processes aware of object locations. Although 
Charm encourages modular development, it does not directly 
support composable representations of distributed data. 

Occam |[T3l and its descendant XC (20j are message 
passing languages for distributed memory architectures. 
Predicable execution is a key principle of them and this 
is achieved primarily with a compile-time allocation of 
memory by prohibiting recursion and dynamically sized 
arrays. Implementations require the allocation of processors 
to be specified statically in a mapping file and program 
components cannot employ distributed parallelism internally. 
Developments as part of the occam 3 specification intro- 
duced the concept of a server component [21 Chapt. 13]. 
The proposed notation builds on this with a distributed 
execution model and relaxed communication constraints. 



III. Proposal 

A. Architectural model 

The proposed programming model is based on a simple 
conceptual architecture where, to a first order approximation, 
there is an infinite array of processors. Each processor 
has a relatively small private memory, but the ability to 
communicate with any other processor via a network in 
a constant amount of time, independent of the processor 
locations. This is an idealised view held by the programmer 
to simplify programming. 

A realistic parallel machine can provide a good approx- 
imation to this with a fixed number of processors and 
a logarithmic-diameter, high-capacity network such as a 
Clos/fat tree ll22l or hypercube |23|. Networks such as 
meshes do not provide these properties and programs must 
be carefully mapped to preserve locality to obtain good 
performance. 

This model is analogous to the random access machine 
(RAM) model of computation [24 1 which models the essen- 
tial aspects of a conventional sequential computer. It consists 
of a program that operates on an infinite capacity memory 
where accesses take a constant amount of time, independent 
of the address. In practical sequential computers, memory 
size is limited and access incurs a latency related to capacity, 
also by a logarithmic scaling. 

B. Notations 

The following is an informal description of the proposed 
language notations. An imperative block-structured syntax 
is used and the basic features of this are based on the Oc- 
cam programming language |[T3l . It includes sequential and 
parallel composition, replication and channel-based commu- 
nication and provides a platform for the main contributions 
of this paper: notations to express local and distributed 
parallelism and a server construct. Local parallelism relates 
to concurrent threads that access a shared memory and dis- 
tributed relates to distinct memories. Diagrams are included 
throughout to provide an intuition for the programming 
model and behaviour of the notations in isolation and in 
composition. 

1) Composition: A program is built as a hierarchical 
collection of processes that can be composed in sequence 
or in parallel. Sequential composition is denoted by the ' ; ' 
separator and causes a set of processes to be executed one 
after another. If P, Q and R are processes, then the process 

P ; Q ; R 

is executed by running P, Q and then R. Sequential com- 
position can be replicated to produce a number of similar 
processes executed in sequence. If P(i) is a process, then 
the process 

seq i=l for n do P (i) 



is equivalent when n = 4 to 

P(0) ; P(l) ; P(2) ; P(3) . 

Parallel composition causes the component processes to 
start simultaneously and the execution can be directed to 
occur locally or distributed over an array of processors. 
Local parallel execution is denoted by the ' | ' separator. The 
process 

P I Q I R. 

causes the component processes P, Q and R to start 
simultaneously on a processor pk, where k is the identifier 
(ID) of a processor, and it terminates when all component 
processes have terminated. 

Pk 



P Q R 



Distributed parallel composition is denoted by the '&' sep- 
arator. The process 

P & Q & R 

is equivalent to the above local composition, except that 
P, Q and R start simultaneously on different processors 

Pk,Pk+i,Pk+2- 

Pk Pk+l Pk+2 



R 



Distributed composition can be replicated to produce a 
number of similar parallel processes and can be thought of 
as declaring a process array. The process 

par i=l for n do P(i) 

is equivalent when n = 4 to 

P(0) & P(l) & P(2) & P(3). 

Processes in distributed composition are allocated on 
consecutively numbered processors as this simplifies the task 
of establishing communication channels with them because 
they can addressed with a base and offset. This property of 
the notation also allows correspondences to be established 
between different arrays. For example, replication, combined 
with local composition can be used to layer arrays of parallel 
processes on the same array of processors. The process 

par i=l for TO do P ( i ) I 

par i=l for n do Q(i) 
for 771 = 77 causes each processor pk,Pk+i, 
execute P(x) and Q(x) for some x. 

Pk Pk+l Pk+n-l 



, Pk+r, 
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Figure I. A server process, serving a set of clients. 

The result of this is a direct correspondence between P and 
Q with the same index and any communication between 
them will be performed locally. For rn ^ n, one array will 
be larger and allocated over more processors. In contrast, 
the distributed composition of the same replicators 

par i = l for m do P(i) & 
par i = l for n do Q (i) 

allocates both process arrays on disjoint sets of processors 
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for I > k + n. 

2) Servers: The server notation provides a simple way of 
separating a representation of data from the computations 
which act on it and can be used in conjunction with 
replicators to implement distributed structures that can be 
accessed concurrently. Furthermore, it allows both shared 
and distributed memory style parallelism to be expressed in 
a similar way. This is a significant capability as it allows a 
programmer to move easily between them. 

A server is a special kind of process that is only active in 
response to clients. The interface to a server is a set of calls, 
which behave in the same way as conventional procedure 
calls, except the parameters and results are transferred to 
and from the server so that execution of the call occurs 
at the server. Fig. Q] illustrates a single server with a set of 
clients. This mechanism is known generally as a remote pro- 
cedure call (RPC) l25l and is attractive because it provides 
clean semantics, hiding the underlying communication, and 
provides the ability to move easily between the local and 
remote forms of a call. 

A server definition specifies a set of potential calls and 
provides responses to them. Its only action while running 
is to repeatedly serve calls and it terminates when its scope 
terminates. Local state can be initialised by a special initial- 
isation process and a corresponding termination process can 
be used to finalise the server upon termination. In object- 
orientated programming, this relates directly to the concept 



of an object with a constructor that takes an initial value and 
methods that operate on the private attributes. 

As an example, Process [1] defines a server to provide 
access to an array. When it initialises, each element of the 
array is set to an initial value, specified as a parameter 
(init), and when the server is running, calls can be made 
to read or write to specific locations. 

Process 1 

server Store (val init) 
interface ( 

call read (val i, var v) , 
call write (val i, val v) 
{ var data [N] ; 
inital 
{ var i; 

seq i=0 for N do 
data [ i ] : = init 

} 

accept 

{ read ? (val i, var v) 

v := data[i] 
write ? (val i, val v) 

data[i] := v 



to 



} 



} 

final {} 



The following specifies an instance of the Store server 
with the name s, for use with an anonymous client process 
that executes in parallel and makes calls to write to each 
store location. 

server s is Store (0) & 

seq i=0 for n do s. write (i, i) 

Servers can be replicated with a similar notation to a 
conventional array declaration. For example the server array 

server s is Store (0) [n] 

creates n instances of the store server, with each initialised 
by the same parameters, in this case 0. A call to a particular 
server is made by specifying a server with an array subscript 
such as s [ ] . 



Pk 



Pk+i 



Pk+n-l 



C. Expressing data-parallelism 

With the proposed notations for controlling distribution 
and creating arrays of servers that can be accessed by 
collections of clients, it is possible to express both shared 
and distributed memory forms of data-parallel computations. 
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Figure 2. Illustration of the layout and structure of the shared memory 
implementation in Process|2] The storage is distributed over a disjoint array 
of processors, hence I > k + n. 

1) Shared memory: A shared memory, distributed over 
an array of processors, can be expressed with two server 
arrays, one to act as a store and the other to provide an access 
abstraction. For example, Process |2]provides an access server 
(Access which has the same interface as Store) to each 
of the m client processes. The access and client processes 

Process 2 

server s is Store (0) [n] & 
{ server a is Access (s) [m] \ 
par i=0 for m do 
{ • • •; a [i] .write (•, •) ; ■ • ■ } 

} 

are layered over the same processors so interaction between 
these is local. Each access server holds a reference to the 
array of n storage servers and takes read and write requests 
from the client and performs them over this array. Fig. [2] 
illustrates the layout and structure of this. 

To avoid uneven distribution of accesses and load on 
particular servers, which would result in increased access 
latency, the access servers could select storage servers by 
some appropriate hash function. This is the form of a 
PRAM and the memory system of a BSP machine. For 
the most general concurrent-read concurrent-write (CRCW) 
form of memory, read combining could also be used to avoid 
excessive access collisions l26l . 

2) Distributed memory: A distributed representation of 
data can be expressed in a similar way, without an ac- 
cess abstraction and with the server and client processes 
distributed over the same set of processors. Process [3] is 
similar to Process [2] except clients are co-located with a 
storage server and access it directly. Since there is a local 

Process 3 

server s is Store (0) [n] \ 
par i = for n do 

{ • • •; a [i] .write (•, •) ; • • • } 

correspondence between servers and clients, this call will 
not incur any overhead due to the underlying interconnection 
network. 
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Figure 3. The layout and structure of the distributed memory Process f5] 
Servers are situated with clients for fast access. 

Data stored with other server or client processes could 
be accessed with server calls, but race conditions can arise 
from concurrent access to shared data and synchronisation 
is required to avoid this. Instead, synchronised message 
passing communication avoids these issues and is widely 
used for scalable algorithms, typically in large systems such 
as supercomputers. In general, simple scalable structures 
such as pipelines, grids and trees are used [27] which are 
easily expressed in occam and hence are composable with 
server-based representations of data. This is demonstrated 
with the matrix multiplication example in Section IV- Al 

Shared and distributed memory forms of data-parallelism 
lend themselves to different applications and the ability of 
the proposed programming model to cleanly support both is 
significant. It provides the programmer with the flexibility 
to employ a notation that best suits a given application. 

IV. Compilation 

The choice of notations and their restrictions allow for an 
efficient implementation. This does however depend on the 
provision of certain functionality to support the execution of 
a collection of communicating parallel processes and, in par- 
ticular, many-to-one patterns of communication. These are 
described first, as an architectural target for the compilation 
scheme. 

A. Architectural target 

The following defines the basic requirements of the pro- 
posed language notations, independent of a specific hard- 
ware or software implementation. 

1) Processor addressing. Each processor in a system of 
p processors has a unique integer ID in the range to 
p—1 identifying it. 

2) Multi-threading. A processor has the ability to sup- 
port multiple concurrent threads of execution and any 
thread has the ability to create additional threads. 

3) Point-to-point communication. Any two threads can 
communicate by passing messages over bidirectional 
point-to-point channels. A channel is composed of two 
channel ends that are each local to a thread. A channel 
end has an ID that combines a local unique ID with 
the processor's ID so that it can be uniquely identified 
in a system. Before a process p can send a message 
to another process q, it must set the destination of a 



local channel end to be the channel end ID of q, that 
q is using to receive messages. It is not necessary for 
q to specify p as the source unless it sends a message 
in return to p. All messages are delivered in-order. 
4) Many-to-one connections. A channel end may be 
specified as a destination by multiple senders. In this 
case, a sender must be able to establish a connection to 
ensure other messages from different senders cannot 
be delivered and interrupt a communication sequence. 
These requirements are based on the INMOS trans- 
puter El and related XMOS XS1 J29) architectures, which 
provide low-level or hardware support for them. Other 
larger-scale message passing architectures such as Blue- 
Gene/L [ 30 1 and BlueGene/Q [31] realise similar concepts 
in their software point-to-point messaging layer. 

B. Scheme 

1) Compile -time process allocation: As the size of all 
process arrays (both replicated processes and servers) can 
be determined at compile-time, it is possible to determine a 
complete static schedule for the allocation of processes to 
processors. This maps process arrays to contiguous blocks 
of processors and logically adjacent processes to the same 
processor. For example, the runtime use (and reuse) of 
processors by Processes illustrated by Fig. [4] This dynamic 
behaviour is analogous to the allocation of stack frames in 
memory for procedure calls. 

Process 4 

server a is A (•••)[ n ] & 
{ P-, Q; R; 

{ server b is B (••■)[ m ] I 
server c is C (••■)[ m ] I 
{ X; Y } 

} 

} 
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Allocation is performed by initialising a base processor 
b to be ID 0. A process is then assigned to processor 6 and 
for each distributed parallel composition that it contains, 
n component processes of it are assigned to processors 
b, b + 1, • • • , b + n — 1. The allocation is then applied 
recursively to each component process with b set to b + n. 
Parallel composition with local distribution is compiled 
into thread-based execution with instruction sequences to 
perform initialisation, start execution and synchronise before 
termination. 

2) Server communication: A single server is addressed 
by its processor ID and local channel end ID. This can 
be packed into a single word and passed as a reference. 
An array of servers are addressed by a base processor ID, 
common local channel end ID and an offset. This allows 
a normal server call s.c(- ■ • ) or subscripted call s[i].c(- ■ ■ ), 
where s is the server reference, to exactly specify a particular 
server. 
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Figure 4. Illustration of the runtime use of processors according to the 
compile-time allocation for Process f4] 'step' relates to the sequence of 
execution. 



The set of calls for a server are implemented with this 
single channel and each call is assigned an ID unique to the 
server. Let ao,ai,--- ,a n -i be a set of actual parameters 
and P be a process making a call c to a server s of the 
form 

s.c{ao,ai, ■ ■ ■ ,a„_i) . 

Then, for a channel end c local to P, it is compiled as the 
sequence: 

1) set the destination of c to be the channel end of s; 

2) connect to s; 

3) send the channel end ID for c; 

4) send the call ID for s; 

5) send each actual parameter ac 

6) receive each referenced actual a!, L and set a% <— a'^; 

7) disconnect from s. 

Once the client has connected to the server, it sends the 
identity of its channel end so that the server can make the 
necessary corresponding responses to the above sequence. 
By establishing a connection with the server, calls made by 
other clients will block until the server becomes free. In this 
sense, server calls are atomic. 

A key issue in the implementation of servers is to guaran- 
tee that calls always complete. In a simple implementation, 
there is potential for deadlock to occur. This is caused by a 
situation where multiple clients are waiting for a busy server. 
If to service a call the server must perform communication, 
it might not be possible to establish a route in the network 
due to waiting requests holding network resources. To avoid 
this, a server must always be able to consume requests so 
that a call is always guaranteed to complete. In practice, 
the number of clients accessing any one server is likely 
to be small and a small queue, with a size logarithmically 
related to the number of clients, will probably suffice for 
most programs. To avoid deadlock when the queue becomes 
full, clients can reattempt to connect, at a rate according 
to an exponential backoff, similar to the Ethernet protocol. 
Alternatively, two separate physical or logically partitioned 



networks could be used, one for server calls and the other 
for general communication. This way, queued calls would 
never interfere with any external communication a server 
makes. 

3) Process distribution: The processor allocation is 
known for each process at compile-time. At run time, the 
instruction sequence constituting a process must be available 
at a processor that is scheduled to execute it. There are two 
approaches that can be taken to this. With static distribution, 
compilation would produce a set of p binary images for 
a p processor system, with each binary containing all the 
processes that will be executed by the given processor. This 
requires each processor to have a large enough memory to 
store every process that it will execute over the course of 
a program, in addition to the memory requirements of each 
process. For large p, the size of the binary package could 
also be significant. With dynamic distribution, processes are 
loaded onto processors at runtime, before they are executed. 
Compilation produces two binaries, a master image that 
contains all the program and a slave image that waits to 
receive processes to execute. The benefit of this is a smaller 
per-processor memory requirement and binary package in- 
dependent of the size of a system. Dynamic distribution can 
be made efficient by employing recursion [32|. 

In addition to a component parallel process being avail- 
able at a processor, execution on a remote processor also 
requires the complete lexical environment, i.e. all of the 
variables it uses that are external to its scope. This can be 
determined at compile-time and message passing sequences 
generated both to supply these variables and to receive any 
updates to them when the process terminates. 

V. Examples 

This section presents three example programs to demon- 
strate the proposed notations: matrix multiplication, a ray 
tracer and a compiler. The choice of these is based on 
general-purpose applications that require different styles of 
parallelism. 

A. Matrix multiplication 

Matrix multiplication is widely used in scientific pro- 
grams. It is inherently data-parallel and the most scalable 
parallel formulations employ message passing structures. 
Cannon's algorithm [ 33 1 is a simple distributed algorithm 
that is structured as a 2D grid. 

For an n x n grid of processes, this can be expressed 
as Process [5] It takes three arrays of sub-matrix servers 
(a, b and c) as parameters that represent the input and 
result matrices. The subroutine proceeds by creating a 2D 
array of nodes with each node connected by channels in 
four directions and assigned a single sub-matrix server. 
The node process performs computations on the local sub 
matrices sends and receives sub-matrices in each direction 
according to the algorithm. This subroutine encapsulates the 



algorithm, separating the message passing implementation 
from the distributed representation of the matrices. The 
layout of this is illustrated in Fig. [5b] 

A subroutine like this will most likely be employed as a 
component of a more complex program, but even included 
in a program that does nothing else, it requires additional 
components for the initialisation of the input matrices and 
a way to output the result. A simple way to do this is 
to directly read or write values to the distributed matrices 
in a global initialisation phase. Process [6] for example, 
iterates over each sub matrix and performs initialisation 
directly. A similar process could be conducted to output 
the result. A complete minimal program to perform matrix 
multiplication could then be composed as Process [7] where 
the three matrices are declared as server arrays with a 
layered distribution. The client process sequentially loads 
the input matrices, performs the multiplication and outputs 
the result. Fig. [5] illustrates the distribution of processes and 
communication patterns for the load and multiply phases of 
the algorithm. 

Process 5 

proc multiply ( 

server Matrix [n] [n] a, 
server Matrix [n] [n] b, 
server Matrix [n] [n] c, val n) is 
{ chan[n] [n+1] h; 
chan [n] [n+1] v; 
var x, y; 

par y = for n do 
par x = for n do 

node (a [x] [y] , b[x][y], c[x][y], 
v[x] [y] , v[x] [ (y+1) rem n] , 
h[y] [x] , h[y] [ (x+1) rem n] ) 

} 

Process 6 

proc loadMatrix ( 

server Matrix [n] [n] m, val n) is 
{ var i, j; 

seq i=0 for n do 
seq j=0 for n do 

loadSubMatrix (m [ i ] [j]) 

} 

Process 7 

server a is Matrix (M, M) [n] [n] 
server b is Matrix (M, M) [n] [n] 
server c is Matrix (M, M) [n] [n] 

{ loadMatrix ( a, n) ; 
loadMatrix (b, n) ; 
multiply (a, b, c, n) ; 
output (c, n) 

} 
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(a) Load phase (b) Multiply phase 

Figure 5. Process distribution and communication structures for successive phases of the matrix multiply program. Each employs different communication 
structures; loading performs a sequence of calls to the server array and the multiplication algorithm performs only local server accesses, but with grid-based 
message passing communication. 



B. Ray tracing 

Ray tracing is a technique for generating realistic 2D 
images from 3D scenes. It is highly parallelisable as the 
calculation of each pixel, based on intersecting a ray with 
a world model, can be performed independently. When 
the world model is small enough to fit into the memory 
of a single processor, a parallel scheme requires only the 
communication of work and results. When it is larger than 
a single memory, it has to be distributed and accessible by 
all processes calculating ray intersections. 

A distributed world model has a simple form with the 
same structure as the shared memory in Process [2] Work is 
distributed in a task farm structure, by a master process 
to a collection of worker processes. This is outlined in 
Process [8] and illustrated in Fig. [6] Process [8] includes 
separate initialisation and output phases, similar to the ones 
described for the matrix multiply program (Process [6]). 

Process 8 

server master is Master () & 
server objs is Ob jectStore ( ) [n] & 
{ server access is WorldAccess (ob js) [to] 
{ var i; 

loadWorldModel (access) ; 
par i=0 for m do 

worker (master, access); 
output (master) 



Pk-i 



} 



} 



Each of the m workers can access the world model 
(distributed over n servers) via a specific server and will 
do so frequently during the computation. In addition to 
optimising the implementation of shared memory, it is 
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Figure 6. Structure of the parallel ray tracer where a world model is 
provided by an array of servers and accessed concurrently by a collection 
of workers. These are delegated work by a single master process. The world 
model is disjointedly distributed (I > k + n) but it could also be layered 
with the workers (I = fc). 



necessary to reduce the number and latency of accesses to 
obtain a scalable ray tracing algorithm [34|. To do this, each 
access server can maintain a summary structure, usually a 
bounding volume hierarchy (BVH), to minimise ray-object 
intersection tests; it can also cache objects. With existing 
parallel programming models, this functionality would be 
implemented as part of the worker, but in Process [8] it is 
encapsulated in the representation of the data, allowing a 
simple world model interface to be presented to the workers. 

C. Compiler 

Compilers are complex programs that employ many 
different algorithmic techniques and data structures. This 
makes them a canonical example of a general-purpose piece 



of software and a non-trivial test case for mapping realistic 
sequential applications to a parallel architecture. Due to 
this, there has been little work on parallel compilation, 
although there are opportunities to, particularly during the 
optimisation and code generation phases ll35l . In particular, 
many optimisations can be applied locally at an expression, 
statement, block or procedure level, and hence may be 
performed independently and in parallel over different parts 
of a parse tree or intermediate representation. 

The structure of a simple compiler is given in Process [9] 

Process 9 

server store is TreeStore () [n] & 
server tree is TreeAccess (store) [m] & 
server symbols is Table () & 
{ parse (tree [ ] , symbols); 

semantics (tree [ ] , symbols, m) ; 
optimise (tree, symbols); 
{ server store is BufStoreO [I] I 
server buffer is BufAccess (store) 
generatelnsts (tree [ ] , buffer); 



} 



} 



Two server arrays store and tree provide a concurrently 
accessible parse tree, using the same principle as Process _0 
Initially, parsing and semantic analysis phases operate se- 
quentially on the parse tree, using a single access server. 
Local optimisations, as part of the optimise subroutine, 
can be performed in parallel on the parse tree and this will 
also require concurrent access to the symbol table. Finally, 
instructions are output sequentially to a distributed buffer. 
This buffer is declared in a separate scope to demonstrate 
it could be included as part of the generatelnsts 
subroutine. 

VI. Conclusion 

This paper proposes a simple programming model for 
expressing scalable parallel programs. A server construct 
can be used in combination with notations for expressing 
local and distributed parallelism to build abstractions for 
distributed data structures with both shared and distributed 
access structures. This gives the programmer the flexibility 
to move between shared and distributed forms of data 
parallelism, depending on the structure of the program and 
scalability requirements. Server-based data structures can 
be composed with other program components in a similar 
way to conventional variable declarations and have similar 
scoping rules. This allows them to be operated on by 
sequences of potentially parallel subroutines, simplifying the 
task of developing a complex parallel program. 

The distribution model allows a compile-time allocation 
of processing resources, to produce a static schedule. This 
provides efficient runtime performance and predictable tim- 
ing, which are essential for building programs that scale 



to large numbers of cores. The compilation scheme re- 
quires support from the architecture, in particular to pro- 
vide bounded low latency communications, to support the 
distribution model and general patterns of communication 
between program components and servers, and in message 
passing structures such as pipelines, grids and trees. 

The example programs demonstrate how the proposed 
notations can be used to compose computational components 
that require varied forms of parallelism with distributed data 
structures, in a clear and concise way. 
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